E3: an Elastic Execution Engine for Scalable Data Processing

نویسندگان

  • Gang Chen
  • Ke Chen
  • Dawei Jiang
  • Beng Chin Ooi
  • Lei Shi
  • Hoang Tam Vo
  • Sai Wu
چکیده

With the unprecedented growth of data generated by mankind nowadays, it has become critical to develop efficient techniques for processing these massive data sets. To tackle such challenges, analytical data processing systems must be extremely efficient, scalable, and flexible as well as economically effective. Recently, Hadoop, an open-source implementation of MapReduce, has gained interests as a promising big data processing system. Although Hadoop offers the desired flexibility and scalability, its performance has been noted to be suboptimal when it is used to process complex analytical tasks. This paper presents E3, an elastic and efficient execution engine for scalable data processing. E3 adopts a “middle” approach between MapReduce and Dryad in that E3 has a simpler communication model than Dryad yet it can support multi-stages job better than MapReduce. E3 avoids reprocessing intermediate results by adopting a stage-based evaluation strategy and collocating data and user-defined (map or reduce) functions into independent processing units for parallel execution. Furthermore, E3 supports block-level indexes, and built-in functions for specifying and optimizing data processing flows. Benchmarking on an in-house cluster shows that E3 achieves significantly better performance than Hadoop, or put it another way, building an elastically scalable and efficient data processing system is possible.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Tcl to Rapidly Develop a Scalable Engine for Processing Dynamic Application Logic

At Healtheon, we used Tcl to rapidly develop a scalable, high performance rule engine for processing dynamic application logic. The nature of our application requirements, plus the challenge of delivering robust software in a timely manner made Tcl an optimal overall choice in our deployment. We were able to improve rule processing performance by careful language construction and support for co...

متن کامل

Elastic and Scalable Processing of Linked Stream Data in the Cloud

Linked Stream Data extends the Linked Data paradigm to dynamic data sources. It enables the integration and joint processing of heterogeneous stream data with quasi-static data from the Linked Data Cloud in near-real-time. Several Linked Stream Data processing engines exist but their scalability still needs to be in improved in terms of (static and dynamic) data sizes, number of concurrent quer...

متن کامل

Elastic Processing of Analytical Query Workloads on IaaS Clouds

Many modern applications require the evaluation of analytical queries on large amounts of data. Such queries entail joins and heavy aggregations that often include user-defined functions (UDF s). The most efficient way to process these specific type of queries is using tree execution plans. In this work, we develop an engine for analytical query processing and a suite of specialized techniques ...

متن کامل

Ivanova Scalable Scientific Stream Query Processing

Ivanova, M. 2005. Scalable Scientific Stream Query Processing. Acta Universitatis Upsaliensis. Uppsala Dissertations from the Faculty of Science and Technology 66. 137 pp. Uppsala. ISBN 91-554-6351-7 Scientific applications require processing of high-volume on-line streams of numerical data from instruments and simulations. In order to extract information and detect interesting patterns in thes...

متن کامل

Distributed Query Processing on the Cloud: the Optique Point of View (Short Paper)

The Optique European project 3 [6] aims at providing an end-to-end solution for scalable access to Big Data integration, where end users will formulate queries based on a familiar conceptualization of the underlying domain. From the users’ queries the Optique platform will automatically generate appropriate queries over the underlying integrated data, optimize and execute them on the Cloud. In ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JIP

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2012